Add more training models and RLHF algorithms #6368

sglucas · 2025-07-23T02:04:40Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs
I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

Add more training models (LLaMA3, Qwen3) and RLHF algorithms (REINFORCE++, RLOO).

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

for more information, see https://pre-commit.ci

YeAnbang · 2025-08-04T01:52:59Z

applications/ColossalChat/coati/distributed/grpo_consumer.py

-        # [minibatch_size x num_of_generation]
-        loss_mask = torch.ones(action_mask.size(0), device=action_mask.device).bool()
+            # [minibatch_size x num_of_generation]
+            loss_mask = torch.ones(action_mask.size(0), device=action_mask.device).bool()



may be better to move the common calculations outside of the if statements for conciseness

YeAnbang · 2025-08-04T02:33:19Z

applications/ColossalChat/coati/distributed/grpo_consumer.py

+            # [minibatch_size x num_generations]
+            advantages = ((reward - reward_mean)).unsqueeze(dim=-1)
+
+            advantages_mean = advantages.mean(dim=0)


Isn't the advantages_mean always 0 as advantage is already zero-centered in the previous step?

YeAnbang · 2025-08-04T02:45:20Z

applications/ColossalChat/coati/distributed/grpo_consumer.py

+            advantages_std = advantages.std(dim=0)
+
+            advantages = (advantages - advantages_mean) / (advantages_std + 1e-4)
+


maybe consider double-checking the reinforce++ baseline advantage calculation. In reinforce ++, each sample's advantage is calculated by subtracting the mean reward of all generation in the global batch, not per prompt mean

YeAnbang · 2025-08-04T02:46:35Z

applications/ColossalChat/coati/distributed/untitled.txt

@@ -0,0 +1,2 @@
+4.51.0: qwen2.5 + grpo, qwen3 + grpo, cannot: llama2, llama3.2
+4.47.0:


remove test log file

YeAnbang · 2025-08-04T02:47:08Z

applications/ColossalChat/rl_example.py

@@ -227,13 +227,13 @@
    os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Disable tokenizers parallelism to avoid deadlock

    inference_model_config = dict(path=args.model)
-    train_model_config = dict(path=args.model, use_flash_attention_2=True, use_cache=False)
+    train_model_config = dict(path=args.model, use_flash_attention_2=False, use_cache=False)


why is flash attention not supported?

YeAnbang · 2025-08-04T02:47:25Z

applications/ColossalChat/rl_example.py

    generate_config = dict(top_k=args.top_k, top_p=args.top_p, temperature=args.temperature)

    if args.backend == "transformers":
        inference_model_config.update(
            dict(
-                use_flash_attention_2=True,
+                use_flash_attention_2=False,


YeAnbang · 2025-08-04T02:49:28Z

applications/ColossalChat/rl_example.py

probably also consider force num_generation to 1 for reinforce++

root and others added 3 commits July 21, 2025 10:02

Merge upstream/grpo-latest and reapply my local changes

47ee955

Merge upstream/grpo-latest and reapply my local changes

77bd4a4

Merge upstream/grpo-latest and reapply my local changes

dd08277

sglucas requested a review from a team as a code owner July 23, 2025 02:04

sglucas changed the base branch from main to grpo-latest July 23, 2025 02:05

[pre-commit.ci] auto fixes from pre-commit.com hooks

9da096f

for more information, see https://pre-commit.ci

YeAnbang reviewed Aug 4, 2025

View reviewed changes

applications/ColossalChat/rl_example.py

Copy link

Contributor

YeAnbang Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably also consider force num_generation to 1 for reinforce++

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add more training models and RLHF algorithms #6368

Add more training models and RLHF algorithms #6368

Uh oh!

sglucas commented Jul 23, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

Uh oh!

		advantages_std = advantages.std(dim=0)

		advantages = (advantages - advantages_mean) / (advantages_std + 1e-4)

		@@ -0,0 +1,2 @@
		4.51.0: qwen2.5 + grpo, qwen3 + grpo, cannot: llama2, llama3.2
		4.47.0:

Add more training models and RLHF algorithms #6368

Are you sure you want to change the base?

Add more training models and RLHF algorithms #6368

Uh oh!

Conversation

sglucas commented Jul 23, 2025

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

YeAnbang Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

YeAnbang Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

YeAnbang Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

YeAnbang Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

YeAnbang Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

YeAnbang Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

YeAnbang Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!